|
A quotient filter, introduced by Bender ''et al.'' in 2011, is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate member query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; ''i.e.'', the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set (''i.e.'', a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives. A typical application for quotient filters, and other AMQ filters, is to serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed. A quotient filter has the usual AMQ operations of insert and query. In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage). This property benefits certain kinds of log-structured merge-trees. ==Algorithm description== The quotient filter is a ''compact'' hash table. Cleary defines a compact hash table as one in which the table entries contain only a portion of the key plus some additional meta-data bits. These bits are used to deal with the case when distinct keys happen to hash to the same table entry. By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key.〔 In a quotient filter a hash function generates a ''p''-bit fingerprint. The ''r'' least significant bits is called the remainder while the ''q'' = ''p'' - ''r'' most significant bits is called the quotient, hence the name ''quotienting'' (coined by Knuth.) The hash table has 2q slots. For some key ''d'' which hashes to the fingerprint ''dH'', let its quotient be ''dQ'' and the remainder be ''dR''. QF will try to store the remainder in slot dQ, which is known as the ''canonical slot''. However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a ''hard collision''—or because even when the keys' fingerprints are distinct they can have the same quotient—a ''soft collision''. If the canonical slot is occupied then the remainder is stored in some slot to the right. As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a ''run''.〔 Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left. However a run whose first fingerprint occupies its canonical slot indicates the start of a ''cluster''.〔 The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster. The three additional bits are used to reconstruct a slot's fingerprint. They have the following function: * is_occupied is set when a slot is the canonical slot for some key stored (somewhere) in the filter (but not necessarily in this slot). * is_continuation is set when a slot is occupied but not by the first remainder in a run. * is_shifted is set when the remainder in a slot is not in its canonical slot. The various combinations have the following meaning: is_occupied is_continuation is_shifted 0 0 0 : Empty Slot 0 0 1 : Slot is holding start of run that has been shifted from its canonical slot. 0 1 0 : not used. 0 1 1 : Slot is holding continuation of run that has been shifted from its canonical slot. 1 0 0 : Slot is holding start of run that is in its canonical slot. 1 0 1 : Slot is holding start of run that has been shifted from its canonical slot. Also the run for which this is the canonical slot exists but is shifted right. 1 1 0 : not used. 1 1 1 : Slot is holding continuation of run that has been shifted from its canonical slot. Also the run for which this is the canonical slot exists but is shifted right. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Quotient filter」の詳細全文を読む スポンサード リンク
|